HEAD:_site/TH_EX/TH_EX01/TH_EX01.html <<<<<<< HEAD:_site/TH_EX/TH_EX01/TH_EX01.html
Pareto chart showing the distribution of returns by product sub-category in 2021 for superstores in United States and Singapore 2021 population pyramid using ggplot2.
Pareto chart is a popular data visualization use to reveal something like the 80-20 rule. Below chart was created using Tableau.
The purpose of the Pareto chart is to highlight the most important among a (typically large) set of factors. In market share study, Pareto charts are useful to reveal if the business is dominated by a handful of key products or customers. In general, businesses will try to avoid their businesses over-dependent by a few products or customers because if something happen to their key products or business partners, the businesses will be very badly affected.
In the nutshell, a Pareto chart is a dual axis chart. The y-axis on the left (also known as primary y-axis) shows the absolute frequency of the observations and their values are presented as bars. The y-axis on the right shows the cumulative frequency of the observed values and they are presented as dots. These dots are then joined by a smooth line.
In this exercise, we are going to create a pareto chart showing the distribution of returns by product sub-category by using a data set from a retail store group.
Figure below shows the workflow of a typical data science and analytics process and the tidyverse R packages that specially designed to support the specific tasks in the data science and analytics process.
Before we get started, it is important for us to ensure that the required R packages have been installed. For the purpose of the exercise, the follow tidyverse packages will be used:
The code chunk below is used to check if the necessary R packages are installed in R. If they have yet, then RStudio will install the missing R package(s). If are already been installed, then they will be launch in R environment.
packages =c('tidyverse','readxl', 'knitr')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T )
}
It is important note that readxl package has to be listed separately in the packages list because it is not part of the core tidyverse package.
In this exercise, superstore-2021 data set will be used. It is an MS Excel workbook. It consists of three worksheets, namely: Orders, People and Returns.
The code chunk below imports superstore-2021.xls into R environment by using reaxl.
orders <- read_xls("data/Superstore-2021.xls",
sheet = "Orders")
returns <- read_xls("data/Superstore-2021.xls",
sheet = "Returns")
After parsing the worksheet into R, it is a good practice to check the structure and content of the newly tibble data frames in RStudio by opening the data tab.
In this step, the left_join() of dplyr is used to join the returns data frame and orders data frame by using Order ID as the unique identifier.
joined_tab <- left_join(returns, orders,
by = c('Order ID' = 'Order ID'))
Next, we are going to compute the frequency count of returns by sub-category. There are two ways to complete the task. The first way is by using the group-by method and the second way is by using the count method of dplyr.
In the code chunk below, group_by() of dplyr package is used to group the orders by Sub-Category. Then, summarise() of dplyr is used to count (i.e. n()) the number of returned orders.
freq_returned <- joined_tab %>%
group_by(`Sub-Category`) %>%
summarise('Returns' = n()) %>%
ungroup()
The code chunk below shows the alternative way to derive the frequency of returns by sub-category. In this case, count() of dplyr package is used.
freq_returned <- joined_tab %>%
count(`Sub-Category`) %>%
rename(Returns = n)
By default, count() will return a new field called n to store the result. In view of this, rename() of dplyr is used to rename n to Returns.
Before we can compute the cumulative frequency, we need to sort the values in the sub-category field by the values in the Returns field. To accomplish this task, the arrange() of dplyr package is used as shown in the code chunk below.
freq_sorted <- freq_returned %>%
arrange(desc(Returns))
The sorted data frame should look similar to the screenshot below.
Out next task is to compute the cumulative frequency of returns by product sub-category. This task will be performed by using mutate() of dplyr package and cumsum() of Base R.
freq_cum <- freq_sorted %>%
mutate(cumfreq = cumsum (Returns))
There are two types of bar charts: geom_bar and geom_col. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is. Since we want the heights of the bars to represent values in the data, we choose to use geom_col in this case.
ggplot(data = freq_cum,
aes(x = reorder(`Sub-Category`,-Returns))) +
geom_col(aes(y = Returns), color = 'darkturquoise',fill = 'darkturquoise')
We can also use geom_bar() and override the default connection between geom_bar() and stat_count() by changing “stat” argument with below code:
ggplot(data=freq_cum,
aes(x = reorder(`Sub-Category`,-Returns))) +
geom_bar(aes(y = Returns), stat ="identity", color = 'darkturquoise', fill = 'darkturquoise')
geom_text are useful for labeling plots. geom_text() adds only text to the plot. geom_label() draws a rectangle behind the text. Since we still have cumulative % line and horizontal reference line to be plotted in the same chart, to keep the chart neat, geom_text() is a better choice.
geom_text(aes(y = Returns, label = Returns), size = 4,position = position_dodge(width = 0.5), vjust = -0.25)
In order to show the cumulative % of total No. of returns on the same plot with No.of returns, we need to add in a second y axis. scale_y_continuous has the argument of sec.axis.
1st axis limits are set (0,3,226). 3,226 is the total No. of returns of all categories. labels are specified in numbers with no decimal points since they represent count of frequency. 2nd axis is built on 1st axis limits, divided by 3,226 to make the limits (0,1). labels are specified in percentage with no decimal points. breaks are set between (0,1) with 0.1 interval.
1st y axis is named as “No. of Returns”. 2nd y axis is named as “Cumulative % of Total No. of Returns”.
scale_y_continuous(name = "No. of Returns",
limits = c(0,3226),
breaks = c(seq(0,3226,322.6)),
labels = scales::number_format(accuracy = 1),
sec.axis = sec_axis(trans = ~./3226,
name = "Cumulative % of Total No. of Returns",
labels = scales::percent_format(accuracy = 1), breaks = c(seq(0,1,0.1))))
geom_path connects the observations in the order in which they appear in the data. geom_line connects them in order of the variable on the x axis. Since we have sorted the data beforehand in a descending order, we can use either geom_path or geom_line.
cumfreq/sum(Returns)X3226 are calculated and used as the y variable for the line chart. group = 1 is a “dummy” grouping to override the default group by x variable.
geom_line(aes(y = cumfreq/sum(Returns)*3226, group = 1), size = 1, color = "salmon" )
In order to add a horizontal reference line to highlight the sub-categories that contributes to 80% of total No. of returns, geom_hline is added.
geom_hline(aes(yintercept = 0.8*3226), size = 1, color = "grey", linetype = "dashed")
geom_point is used to create scatterplots. In our case, we add in points to highlight the data points of cumulative % of total No. of returns.
geom_point(aes(y = cumfreq/sum(Returns)*3226))
For better presentation, labels and theme are customized for better presentation.
x axis is labelled as “Sub_Category” by using xlab. Label angle is set at 45 degree for better fit. Title is added and customized with proper size and format by modifying components of theme. Panel border, background and grid lines are also customized by modifying components of theme.
The complete code chunk and output are shown below.
ggplot(data = freq_cum,
aes(x = reorder(`Sub-Category`,-Returns))) +
geom_col(aes(y = Returns), color = 'darkturquoise',fill = 'darkturquoise') +
geom_text(aes(y = Returns, label = Returns), size = 4, position = position_dodge(width = 0.5), vjust = -0.5) +
scale_y_continuous(
name = "No. of Returns",
limits = c(0,3226),
breaks = c(seq(0,3226,322.6)),
labels = scales::number_format(accuracy = 1),
sec.axis = sec_axis(trans = ~./3226,
name = "Cumulative % of Total No. of Returns",
labels = scales::percent_format(accuracy = 1),
breaks = c(seq(0,1,0.1)))) +
geom_line(aes(y = cumfreq/sum(Returns)*3226, group = 1), size = 1, color = "salmon" ) +
geom_hline(aes(yintercept = 0.8*3226), size = 1, color = "grey", linetype = "dashed") +
geom_point(aes(y = cumfreq/sum(Returns)*3226)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Pareto Chart of No. of Returns by Product Category",
subtitle="8 product categories contribute to around 80% of total No. of returns",
x="Sub-category",
caption = "Superstores in United States in 2021") +
theme(plot.title = element_text("Pareto Chart of No. of Returns by Product Category",
size = 15, face = "bold", hjust = 0)) +
theme(panel.border = element_rect(colour="grey60", fill = NA),
panel.background = element_rect(fill = "white"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "grey90"),
panel.grid.minor = element_line(size = 0.2, linetype = 'dashed',
colour = "grey95"))
Population Pyramid or “age-sex pyramid” is a graphical illustration of the distribution of a population by age groups and sex. Males are usually shown on the left and females on the right, and they may be measured in absolute numbers or as a percentage of the total population. The pyramid can be used to visualize the age of a particular population. It is also used in ecology to determine the overall age distribution of a population; an indication of the reproductive capabilities and likelihood of the continuation of a species (Wikipedia). Below chart was created using Tableau.
In this exercise, we are going to create a population pyramid showing the distribution of Singapore 2021 population.
Since we have already installed the relevant packages in Task 1, we directly proceed to the data import.
For this task, the data entitle Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2021 is used. The data set is available at Department of Statistics home page
The code chunk below imports respopagesextod2021.csv into R environment by using read_csv() function of readr package.
popagesex <- read_csv("data/respopagesextod2021.csv")
After parsing the worksheet into R, it is a good practice to check the structure and content of the newly tibble data frames in RStudio by opening the data tab.
We need to sort the values alphanumerically by the age group. The following code chunk changes age group “5_to_9” to “05_to_09” to allow for ggplot default labels to arrange the data according to age group. If not, “5_to_9” would be placed among the 50s group data.
popagesex <- popagesex %>%
mutate(AG = case_when(AG == "0_to_4" ~ "00_to_04",
AG == "5_to_9" ~ "05_to_09",
TRUE ~ AG) )
subset() returns subsets of vectors, matrices or data frames which meet conditions. We use this method to split population data by gender, so that we can plot male data on the negative side and female data on the positive side of the y axis.
There are two types of bar charts: geom_bar and geom_col. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is. We can also use geom_bar() and override the default connection between geom_bar() and stat_count() by changing “stat” argument to “identity”.
The full code chunk and output are shown below.
ggplot(data=popagesex,
aes(x = AG, fill = Sex)) +
geom_bar (data = subset(popagesex, Sex =="Females"), aes(y = Pop), stat = "identity") +
geom_bar (data = subset(popagesex, Sex =="Males"), aes (y = -Pop), stat = "identity") +
xlab("Age") +
scale_y_continuous(name = "Population",
breaks = c(seq(-160000, 160000, 20000)),
labels = paste0(as.character(c(seq(160, 0, -20), seq(20, 160, 20))), "K")) +
coord_flip() +
theme_bw() +
ggtitle("Singapore 2021 Population Pyramid") +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5))
Compared to gglpot2, Tableau is more user-friendly and more efficient.
ggplot2 allows for more customized solutions. Users’ idea can be expressed in greater details and better accuracy. Comparing above 2 pareto charts plotted by Tableau and ggplot2 separately, the gglot2 chart shows not only the No. of returns and cumulative %, but also shows the total no. of returns by changing the axis limit. Captions and subtitles can also be added for detailed illustration.
ggplot2 allows for multiple solutions. The count of frequency step above were proceeded using both the count method and the the group_by method. The process of exploring different possibilities are interesting.
Pareto chart showing the distribution of returns by product sub-category in 2021 for superstores in United States and Singapore 2021 population pyramid using ggplot2.
Pareto chart is a popular data visualization use to reveal something like the 80-20 rule. Below chart was created using Tableau.
The purpose of the Pareto chart is to highlight the most important among a (typically large) set of factors. In market share study, Pareto charts are useful to reveal if the business is dominated by a handful of key products or customers. In general, businesses will try to avoid their businesses over-dependent by a few products or customers because if something happen to their key products or business partners, the businesses will be very badly affected.
In the nutshell, a Pareto chart is a dual axis chart. The y-axis on the left (also known as primary y-axis) shows the absolute frequency of the observations and their values are presented as bars. The y-axis on the right shows the cumulative frequency of the observed values and they are presented as dots. These dots are then joined by a smooth line.
In this exercise, we are going to create a pareto chart showing the distribution of returns by product sub-category by using a data set from a retail store group.
Figure below shows the workflow of a typical data science and analytics process and the tidyverse R packages that specially designed to support the specific tasks in the data science and analytics process.
Before we get started, it is important for us to ensure that the required R packages have been installed. For the purpose of the exercise, the follow tidyverse packages will be used:
The code chunk below is used to check if the necessary R packages are installed in R. If they have yet, then RStudio will install the missing R package(s). If are already been installed, then they will be launch in R environment.
packages =c('tidyverse','readxl', 'knitr')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T )
}
It is important note that readxl package has to be listed separately in the packages list because it is not part of the core tidyverse package.
In this exercise, superstore-2021 data set will be used. It is an MS Excel workbook. It consists of three worksheets, namely: Orders, People and Returns.
The code chunk below imports superstore-2021.xls into R environment by using reaxl.
orders <- read_xls("data/Superstore-2021.xls",
sheet = "Orders")
returns <- read_xls("data/Superstore-2021.xls",
sheet = "Returns")
After parsing the worksheet into R, it is a good practice to check the structure and content of the newly tibble data frames in RStudio by opening the data tab.
In this step, the left_join() of dplyr is used to join the returns data frame and orders data frame by using Order ID as the unique identifier.
joined_tab <- left_join(returns, orders,
by = c('Order ID' = 'Order ID'))
Next, we are going to compute the frequency count of returns by sub-category. There are two ways to complete the task. The first way is by using the group-by method and the second way is by using the count method of dplyr.
In the code chunk below, group_by() of dplyr package is used to group the orders by Sub-Category. Then, summarise() of dplyr is used to count (i.e. n()) the number of returned orders.
freq_returned <- joined_tab %>%
group_by(`Sub-Category`) %>%
summarise('Returns' = n()) %>%
ungroup()
The code chunk below shows the alternative way to derive the frequency of returns by sub-category. In this case, count() of dplyr package is used.
freq_returned <- joined_tab %>%
count(`Sub-Category`) %>%
rename(Returns = n)
By default, count() will return a new field called n to store the result. In view of this, rename() of dplyr is used to rename n to Returns.
Before we can compute the cumulative frequency, we need to sort the values in the sub-category field by the values in the Returns field. To accomplish this task, the arrange() of dplyr package is used as shown in the code chunk below.
freq_sorted <- freq_returned %>%
arrange(desc(Returns))
The sorted data frame should look similar to the screenshot below.
Out next task is to compute the cumulative frequency of returns by product sub-category. This task will be performed by using mutate() of dplyr package and cumsum() of Base R.
freq_cum <- freq_sorted %>%
mutate(cumfreq = cumsum (Returns))
There are two types of bar charts: geom_bar and geom_col. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is. Since we want the heights of the bars to represent values in the data, we choose to use geom_col in this case.
ggplot(data = freq_cum,
aes(x = reorder(`Sub-Category`,-Returns))) +
geom_col(aes(y = Returns), color = 'darkturquoise',fill = 'darkturquoise')
We can also use geom_bar() and override the default connection between geom_bar() and stat_count() by changing “stat” argument with below code:
ggplot(data=freq_cum,
aes(x = reorder(`Sub-Category`,-Returns))) +
geom_bar(aes(y = Returns), stat ="identity", color = 'darkturquoise', fill = 'darkturquoise')
geom_text are useful for labeling plots. geom_text() adds only text to the plot. geom_label() draws a rectangle behind the text. Since we still have cumulative % line and horizontal reference line to be plotted in the same chart, to keep the chart neat, geom_text() is a better choice.
geom_text(aes(y = Returns, label = Returns), size = 4,position = position_dodge(width = 0.5), vjust = -0.25)
In order to show the cumulative % of total No. of returns on the same plot with No.of returns, we need to add in a second y axis. scale_y_continuous has the argument of sec.axis.
1st axis limits are set (0,3,226). 3,226 is the total No. of returns of all categories. labels are specified in numbers with no decimal points since they represent count of frequency. 2nd axis is built on 1st axis limits, divided by 3,226 to make the limits (0,1). labels are specified in percentage with no decimal points. breaks are set between (0,1) with 0.1 interval.
1st y axis is named as “No. of Returns”. 2nd y axis is named as “Cumulative % of Total No. of Returns”.
scale_y_continuous(name = "No. of Returns",
limits = c(0,3226),
breaks = c(seq(0,3226,322.6)),
labels = scales::number_format(accuracy = 1),
sec.axis = sec_axis(trans = ~./3226,
name = "Cumulative % of Total No. of Returns",
labels = scales::percent_format(accuracy = 1), breaks = c(seq(0,1,0.1))))
geom_path connects the observations in the order in which they appear in the data. geom_line connects them in order of the variable on the x axis. Since we have sorted the data beforehand in a descending order, we can use either geom_path or geom_line.
cumfreq/sum(Returns)X3226 are calculated and used as the y variable for the line chart. group = 1 is a “dummy” grouping to override the default group by x variable.
geom_line(aes(y = cumfreq/sum(Returns)*3226, group = 1), size = 1, color = "salmon" )
In order to add a horizontal reference line to highlight the sub-categories that contributes to 80% of total No. of returns, geom_hline is added.
geom_hline(aes(yintercept = 0.8*3226), size = 1, color = "grey", linetype = "dashed")
geom_point is used to create scatterplots. In our case, we add in points to highlight the data points of cumulative % of total No. of returns.
geom_point(aes(y = cumfreq/sum(Returns)*3226))
For better presentation, labels and theme are customized for better presentation.
x axis is labelled as “Sub_Category” by using xlab. Label angle is set at 45 degree for better fit. Title is added and customized with proper size and format by modifying components of theme. Panel border, background and grid lines are also customized by modifying components of theme.
The complete code chunk and output are shown below.
ggplot(data = freq_cum,
aes(x = reorder(`Sub-Category`,-Returns))) +
geom_col(aes(y = Returns), color = 'darkturquoise',fill = 'darkturquoise') +
geom_text(aes(y = Returns, label = Returns), size = 4, position = position_dodge(width = 0.5), vjust = -0.5) +
scale_y_continuous(
name = "No. of Returns",
limits = c(0,3226),
breaks = c(seq(0,3226,322.6)),
labels = scales::number_format(accuracy = 1),
sec.axis = sec_axis(trans = ~./3226,
name = "Cumulative % of Total No. of Returns",
labels = scales::percent_format(accuracy = 1),
breaks = c(seq(0,1,0.1)))) +
geom_line(aes(y = cumfreq/sum(Returns)*3226, group = 1), size = 1, color = "salmon" ) +
geom_hline(aes(yintercept = 0.8*3226), size = 1, color = "grey", linetype = "dashed") +
geom_point(aes(y = cumfreq/sum(Returns)*3226)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Pareto Chart of No. of Returns by Product Category",
subtitle="8 product categories contribute to around 80% of total No. of returns",
x="Sub-category",
caption = "Superstores in United States in 2021") +
theme(plot.title = element_text("Pareto Chart of No. of Returns by Product Category",
size = 15, face = "bold", hjust = 0)) +
theme(panel.border = element_rect(colour="grey60", fill = NA),
panel.background = element_rect(fill = "white"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "grey90"),
panel.grid.minor = element_line(size = 0.2, linetype = 'dashed',
colour = "grey95"))
Population Pyramid or “age-sex pyramid” is a graphical illustration of the distribution of a population by age groups and sex. Males are usually shown on the left and females on the right, and they may be measured in absolute numbers or as a percentage of the total population. The pyramid can be used to visualize the age of a particular population. It is also used in ecology to determine the overall age distribution of a population; an indication of the reproductive capabilities and likelihood of the continuation of a species (Wikipedia). Below chart was created using Tableau.
In this exercise, we are going to create a population pyramid showing the distribution of Singapore 2021 population.
Since we have already installed the relevant packages in Task 1, we directly proceed to the data import.
For this task, the data entitle Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2021 is used. The data set is available at Department of Statistics home page
The code chunk below imports respopagesextod2021.csv into R environment by using read_csv() function of readr package.
popagesex <- read_csv("data/respopagesextod2021.csv")
After parsing the worksheet into R, it is a good practice to check the structure and content of the newly tibble data frames in RStudio by opening the data tab.
We need to sort the values alphanumerically by the age group. The following code chunk changes age group “5_to_9” to “05_to_09” to allow for ggplot default labels to arrange the data according to age group. If not, “5_to_9” would be placed among the 50s group data.
popagesex <- popagesex %>%
mutate(AG = case_when(AG == "0_to_4" ~ "00_to_04",
AG == "5_to_9" ~ "05_to_09",
TRUE ~ AG) )
subset() returns subsets of vectors, matrices or data frames which meet conditions. We use this method to split population data by gender, so that we can plot male data on the negative side and female data on the positive side of the y axis.
There are two types of bar charts: geom_bar and geom_col. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is. We can also use geom_bar() and override the default connection between geom_bar() and stat_count() by changing “stat” argument to “identity”.
The full code chunk and output are shown below.
ggplot(data=popagesex,
aes(x = AG, fill = Sex)) +
geom_bar (data = subset(popagesex, Sex =="Females"), aes(y = Pop), stat = "identity") +
geom_bar (data = subset(popagesex, Sex =="Males"), aes (y = -Pop), stat = "identity") +
xlab("Age") +
scale_y_continuous(name = "Population",
breaks = c(seq(-160000, 160000, 20000)),
labels = paste0(as.character(c(seq(160, 0, -20), seq(20, 160, 20))), "K")) +
coord_flip() +
theme_bw() +
ggtitle("Singapore 2021 Population Pyramid") +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5))
Compared to gglpot2, Tableau is more user-friendly and more efficient.
ggplot2 allows for more customized solutions. Users’ idea can be expressed in greater details and better accuracy. Comparing above 2 pareto charts plotted by Tableau and ggplot2 separately, the gglot2 chart shows not only the No. of returns and cumulative %, but also shows the total no. of returns by changing the axis limit. Captions and subtitles can also be added for detailed illustration.
ggplot2 allows for multiple solutions. The count of frequency step above were proceeded using both the count method and the the group_by method. The process of exploring different possibilities are interesting.
Pareto chart showing the distribution of returns by product sub-category in 2021 for superstores in United States and Singapore 2021 population pyramid using ggplot2.
Pareto chart is a popular data visualization use to reveal something like the 80-20 rule. Below chart was created using Tableau.
The purpose of the Pareto chart is to highlight the most important among a (typically large) set of factors. In market share study, Pareto charts are useful to reveal if the business is dominated by a handful of key products or customers. In general, businesses will try to avoid their businesses over-dependent by a few products or customers because if something happen to their key products or business partners, the businesses will be very badly affected.
In the nutshell, a Pareto chart is a dual axis chart. The y-axis on the left (also known as primary y-axis) shows the absolute frequency of the observations and their values are presented as bars. The y-axis on the right shows the cumulative frequency of the observed values and they are presented as dots. These dots are then joined by a smooth line.
In this exercise, we are going to create a pareto chart showing the distribution of returns by product sub-category by using a data set from a retail store group.
Figure below shows the workflow of a typical data science and analytics process and the tidyverse R packages that specially designed to support the specific tasks in the data science and analytics process.
Before we get started, it is important for us to ensure that the required R packages have been installed. For the purpose of the exercise, the follow tidyverse packages will be used:
The code chunk below is used to check if the necessary R packages are installed in R. If they have yet, then RStudio will install the missing R package(s). If are already been installed, then they will be launch in R environment.
packages =c('tidyverse','readxl', 'knitr')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T )
}
It is important note that readxl package has to be listed separately in the packages list because it is not part of the core tidyverse package.
In this exercise, superstore-2021 data set will be used. It is an MS Excel workbook. It consists of three worksheets, namely: Orders, People and Returns.
The code chunk below imports superstore-2021.xls into R environment by using reaxl.
orders <- read_xls("data/Superstore-2021.xls",
sheet = "Orders")
returns <- read_xls("data/Superstore-2021.xls",
sheet = "Returns")
After parsing the worksheet into R, it is a good practice to check the structure and content of the newly tibble data frames in RStudio by opening the data tab.
In this step, the left_join() of dplyr is used to join the returns data frame and orders data frame by using Order ID as the unique identifier.
joined_tab <- left_join(returns, orders,
by = c('Order ID' = 'Order ID'))
Next, we are going to compute the frequency count of returns by sub-category. There are two ways to complete the task. The first way is by using the group-by method and the second way is by using the count method of dplyr.
In the code chunk below, group_by() of dplyr package is used to group the orders by Sub-Category. Then, summarise() of dplyr is used to count (i.e. n()) the number of returned orders.
freq_returned <- joined_tab %>%
group_by(`Sub-Category`) %>%
summarise('Returns' = n()) %>%
ungroup()
The code chunk below shows the alternative way to derive the frequency of returns by sub-category. In this case, count() of dplyr package is used.
freq_returned <- joined_tab %>%
count(`Sub-Category`) %>%
rename(Returns = n)
By default, count() will return a new field called n to store the result. In view of this, rename() of dplyr is used to rename n to Returns.
Before we can compute the cumulative frequency, we need to sort the values in the sub-category field by the values in the Returns field. To accomplish this task, the arrange() of dplyr package is used as shown in the code chunk below.
freq_sorted <- freq_returned %>%
arrange(desc(Returns))
The sorted data frame should look similar to the screenshot below.
Out next task is to compute the cumulative frequency of returns by product sub-category. This task will be performed by using mutate() of dplyr package and cumsum() of Base R.
freq_cum <- freq_sorted %>%
mutate(cumfreq = cumsum (Returns))
There are two types of bar charts: geom_bar and geom_col. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is. Since we want the heights of the bars to represent values in the data, we choose to use geom_col in this case.
ggplot(data = freq_cum,
aes(x = reorder(`Sub-Category`,-Returns))) +
geom_col(aes(y = Returns), color = 'darkturquoise',fill = 'darkturquoise')
We can also use geom_bar() and override the default connection between geom_bar() and stat_count() by changing “stat” argument with below code:
ggplot(data=freq_cum,
aes(x = reorder(`Sub-Category`,-Returns))) +
geom_bar(aes(y = Returns), stat ="identity", color = 'darkturquoise', fill = 'darkturquoise')
geom_text are useful for labeling plots. geom_text() adds only text to the plot. geom_label() draws a rectangle behind the text. Since we still have cumulative % line and horizontal reference line to be plotted in the same chart, to keep the chart neat, geom_text() is a better choice.
geom_text(aes(y = Returns, label = Returns), size = 4,position = position_dodge(width = 0.5), vjust = -0.25)
In order to show the cumulative % of total No. of returns on the same plot with No.of returns, we need to add in a second y axis. scale_y_continuous has the argument of sec.axis.
1st axis limits are set (0,3,226). 3,226 is the total No. of returns of all categories. labels are specified in numbers with no decimal points since they represent count of frequency. 2nd axis is built on 1st axis limits, divided by 3,226 to make the limits (0,1). labels are specified in percentage with no decimal points. breaks are set between (0,1) with 0.1 interval.
1st y axis is named as “No. of Returns”. 2nd y axis is named as “Cumulative % of Total No. of Returns”.
scale_y_continuous(name = "No. of Returns",
limits = c(0,3226),
breaks = c(seq(0,3226,322.6)),
labels = scales::number_format(accuracy = 1),
sec.axis = sec_axis(trans = ~./3226,
name = "Cumulative % of Total No. of Returns",
labels = scales::percent_format(accuracy = 1), breaks = c(seq(0,1,0.1))))
geom_path connects the observations in the order in which they appear in the data. geom_line connects them in order of the variable on the x axis. Since we have sorted the data beforehand in a descending order, we can use either geom_path or geom_line.
cumfreq/sum(Returns)X3226 are calculated and used as the y variable for the line chart. group = 1 is a “dummy” grouping to override the default group by x variable.
geom_line(aes(y = cumfreq/sum(Returns)*3226, group = 1), size = 1, color = "salmon" )
In order to add a horizontal reference line to highlight the sub-categories that contributes to 80% of total No. of returns, geom_hline is added.
geom_hline(aes(yintercept = 0.8*3226), size = 1, color = "grey", linetype = "dashed")
geom_point is used to create scatterplots. In our case, we add in points to highlight the data points of cumulative % of total No. of returns.
geom_point(aes(y = cumfreq/sum(Returns)*3226))
For better presentation, labels and theme are customized for better presentation.
x axis is labelled as “Sub_Category” by using xlab. Label angle is set at 45 degree for better fit. Title is added and customized with proper size and format by modifying components of theme. Panel border, background and grid lines are also customized by modifying components of theme.
The complete code chunk and output are shown below.
ggplot(data = freq_cum,
aes(x = reorder(`Sub-Category`,-Returns))) +
geom_col(aes(y = Returns), color = 'darkturquoise',fill = 'darkturquoise') +
geom_text(aes(y = Returns, label = Returns), size = 4, position = position_dodge(width = 0.5), vjust = -0.5) +
scale_y_continuous(
name = "No. of Returns",
limits = c(0,3226),
breaks = c(seq(0,3226,322.6)),
labels = scales::number_format(accuracy = 1),
sec.axis = sec_axis(trans = ~./3226,
name = "Cumulative % of Total No. of Returns",
labels = scales::percent_format(accuracy = 1),
breaks = c(seq(0,1,0.1)))) +
geom_line(aes(y = cumfreq/sum(Returns)*3226, group = 1), size = 1, color = "salmon" ) +
geom_hline(aes(yintercept = 0.8*3226), size = 1, color = "grey", linetype = "dashed") +
geom_point(aes(y = cumfreq/sum(Returns)*3226)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Pareto Chart of No. of Returns by Product Category",
subtitle="8 product categories contribute to around 80% of total No. of returns",
x="Sub-category",
caption = "Superstores in United States in 2021") +
theme(plot.title = element_text("Pareto Chart of No. of Returns by Product Category",
size = 15, face = "bold", hjust = 0)) +
theme(panel.border = element_rect(colour="grey60", fill = NA),
panel.background = element_rect(fill = "white"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "grey90"),
panel.grid.minor = element_line(size = 0.2, linetype = 'dashed',
colour = "grey95"))
Population Pyramid or “age-sex pyramid” is a graphical illustration of the distribution of a population by age groups and sex. Males are usually shown on the left and females on the right, and they may be measured in absolute numbers or as a percentage of the total population. The pyramid can be used to visualize the age of a particular population. It is also used in ecology to determine the overall age distribution of a population; an indication of the reproductive capabilities and likelihood of the continuation of a species (Wikipedia). Below chart was created using Tableau.
In this exercise, we are going to create a population pyramid showing the distribution of Singapore 2021 population.
Since we have already installed the relevant packages in Task 1, we directly proceed to the data import.
For this task, the data entitle Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2021 is used. The data set is available at Department of Statistics home page
The code chunk below imports respopagesextod2021.csv into R environment by using read_csv() function of readr package.
popagesex <- read_csv("data/respopagesextod2021.csv")
After parsing the worksheet into R, it is a good practice to check the structure and content of the newly tibble data frames in RStudio by opening the data tab.
We need to sort the values alphanumerically by the age group. The following code chunk changes age group “5_to_9” to “05_to_09” to allow for ggplot default labels to arrange the data according to age group. If not, “5_to_9” would be placed among the 50s group data.
popagesex <- popagesex %>%
mutate(AG = case_when(AG == "0_to_4" ~ "00_to_04",
AG == "5_to_9" ~ "05_to_09",
TRUE ~ AG) )
subset() returns subsets of vectors, matrices or data frames which meet conditions. We use this method to split population data by gender, so that we can plot male data on the negative side and female data on the positive side of the y axis.
There are two types of bar charts: geom_bar and geom_col. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is. We can also use geom_bar() and override the default connection between geom_bar() and stat_count() by changing “stat” argument to “identity”.
The full code chunk and output are shown below.
ggplot(data=popagesex,
aes(x = AG, fill = Sex)) +
geom_bar (data = subset(popagesex, Sex =="Females"), aes(y = Pop), stat = "identity") +
geom_bar (data = subset(popagesex, Sex =="Males"), aes (y = -Pop), stat = "identity") +
xlab("Age") +
scale_y_continuous(name = "Population",
breaks = c(seq(-160000, 160000, 20000)),
labels = paste0(as.character(c(seq(160, 0, -20), seq(20, 160, 20))), "K")) +
coord_flip() +
theme_bw() +
ggtitle("Singapore 2021 Population Pyramid") +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5))
Compared to gglpot2, Tableau is more user-friendly and more efficient.
ggplot2 allows for more customized solutions. Users’ idea can be expressed in greater details and better accuracy. Comparing above 2 pareto charts plotted by Tableau and ggplot2 separately, the gglot2 chart shows not only the No. of returns and cumulative %, but also shows the total no. of returns by changing the axis limit. Captions and subtitles can also be added for detailed illustration.
ggplot2 allows for multiple solutions. The count of frequency step above were proceeded using both the count method and the the group_by method. The process of exploring different possibilities are interesting.